AITopics | reading order

Collaborating Authors

reading order

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

NVIDIA Nemotron Parse 1.1

Chumachenko, Kateryna, Deshmukh, Amala Sanjay, Seppanen, Jarno, Karmanov, Ilia, Chen, Chia-Chih, Voegtle, Lukas, Fischer, Philipp, Wawrzos, Marek, Motiian, Saeid, Ageev, Roman, Wu, Kedi, Milesi, Alexandre, Moosaei, Maryam, Pawelec, Krzysztof, Subramanian, Padmavathy, Samadi, Mehrzad, Yu, Xin, Dear, Celina, Stoddard, Sarah, Diamond, Jenna, Oliver, Jesse, Chraghchian, Leanna, Skelly, Patrick, Balough, Tom, Xu, Yao, Scowcroft, Jane Polak, Korzekwa, Daniel, Hanley, Darragh, Bhaskar, Sandip, Roman, Timo, Sapra, Karan, Tao, Andrew, Catanzaro, Bryan

arXiv.org Artificial IntelligenceNov-26-2025

We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.20478

Genre: Research Report (0.40)

Industry: Information Technology > Hardware (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report

Chen, Yan, Zou, Yu, Zeng, Jialei, You, Haoran, Zhou, Xiaorui, Zhong, Aixi

arXiv.org Artificial IntelligenceNov-21-2025

Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial gover- nance, transforming capital allocation architectures, regu- latory frameworks, and systemic risk coordination mecha- nisms. However, as the core medium for assessing corpo- rate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic read- ing order from slide-like irregular layouts and implicit hier- archies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a uni- fied framework that transforms ESG reports into structured representations through multimodal parsing, contextual nar- ration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware seg- mentation guided by table-of-contents anchors, and a multi- modal aggregation pipeline that contextually transforms vi- sual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical de- mands of financial research. Extensive experiments on anno- tated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG re- ports, spanning Mainland China, Hong Kong, and U.S. mar- kets, featuring unified structured representations of multi- modal content, enriched with fine-grained layout and seman- tic annotations to better support ESG integration in financial governance and decision-making.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.16417

Country: Asia > China > Hong Kong (0.25)

Genre: Research Report (1.00)

Industry: Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Evaluating Multimodal Large Language Models on Vertically Written Japanese Text

Sasagawa, Keito, Kurita, Shuhei, Kawahara, Daisuke

arXiv.org Artificial IntelligenceNov-20-2025

Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available https://github.com/llm-jp/eval_vertical_ja.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.15059

Country:

North America > United States (0.68)
Europe (0.67)
Asia > Japan > Honshū (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

SCORE: A Semantic Evaluation Framework for Generative Document Parsing

Li, Renyu, Yepes, Antonio Jimeno, You, Yao, Pluciński, Kamil, Operlejn, Maximilian, Wolfe, Crag

arXiv.org Artificial IntelligenceSep-25-2025

Traditional document parsing architectures employ deterministic pipelines that sequentially combine optical character recognition (OCR), layout analysis, and rule-based table extraction to produce structured outputs. The evaluation of these systems has relied on well-established task-specific metrics including Character Error Rate (CER) and Word Error Rate (WER) [14, 20], Intersection-over-Union (IoU) [4, 16], and Tree Edit Distance-based Similarity (TEDS) [31]. These metrics operate under the assumption of unique ground truth representations, rewarding exact matches while systematically penalizing any structural deviations. The emergence of multi-modal generative document parsing systems has fundamentally transformed this landscape. Vision Language Models (VLMs) such as GPT-5 Mini, Gemini 2.5 Flash, and Claude Sonnet 3.7/4 [22, 6, 1, 2], generate holistic document interpretations that integrate visual, textual, and structural signals in an end-to-end manner. Unlike their deterministic predecessors, these systems frequently produce outputs that are semantically correct yet structurally divergent. Consider a table containing merged cells: one system may represent it as a flattened token sequence preserving reading order, while another generates hierarchical HTML markup with explicit structural relationships. Both interpretations faithfully capture the semantic content, yet traditional evaluation frameworks treat them as fundamentally incompatible, systematically misclassifying valid alternative interpretations as parsing errors. This mismatch of the evaluation paradigm has significant practical implications.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2509.19345

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(2 more...)

Add feedback

Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning

Mathew, Christo, Wang, Wentian, Feldman, Jacob, Gallos, Lazaros K., Kantor, Paul B., Menkov, Vladimir, Wang, Hao

arXiv.org Machine LearningSep-10-2025

We investigate reinforcement learning in the Game Of Hidden Rules (GOHR) environment, a complex puzzle in which an agent must infer and execute hidden rules to clear a 6$\times$6 board by placing game pieces into buckets. We explore two state representation strategies, namely Feature-Centric (FC) and Object-Centric (OC), and employ a Transformer-based Advantage Actor-Critic (A2C) algorithm for training. The agent has access only to partial observations and must simultaneously infer the governing rule and learn the optimal policy through experience. We evaluate our models across multiple rule-based and trial-list-based experimental setups, analyzing transfer effects and the impact of representation on learning efficiency.

experiment, quadnearby, representation, (16 more...)

arXiv.org Machine Learning

2509.06213

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > California (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Complete Evasion, Zero Modification: PDF Attacks on AI Text Detection

Creo, Aldan

arXiv.org Artificial IntelligenceAug-5-2025

AI-generated text detectors have become essential tools for maintaining content authenticity, yet their robustness against evasion attacks remains questionable. We present PDFuzz, a novel attack that exploits the discrepancy between visual text layout and extraction order in PDF documents. Our method preserves exact textual content while manipulating character positioning to scramble extraction sequences. We evaluate this approach against the ArguGPT detector using a dataset of human and AI-generated text. Our results demonstrate complete evasion: detector performance drops from (93.6 $\pm$ 1.4) % accuracy and 0.938 $\pm$ 0.014 F1 score to random-level performance ((50.4 $\pm$ 3.2) % accuracy, 0.0 F1 score) while maintaining perfect visual fidelity. Our work reveals a vulnerability in current detection systems that is inherent to PDF document structures and underscores the need for implementing sturdy safeguards against such attacks. We make our code publicly available at https://github.com/ACMCMC/PDFuzz.

detector, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.01887

Country: Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.73)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.59)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.49)

Add feedback

MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

Feng, Zhaopeng, Liang, Yupu, Cao, Shaosheng, Su, Jiayuan, Ren, Jiahan, Xu, Zhe, Hu, Yao, Huang, Wenxuan, Wu, Jian, Liu, Zuozhu

arXiv.org Artificial IntelligenceMay-27-2025

Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT$^{3}$, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT$^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT$^{3}$-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.

large language model, machine learning, translation, (18 more...)

arXiv.org Artificial Intelligence

2505.19714

Country: Asia (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation

Liu, Xiaochuan, Song, Ruihua, Wang, Xiting, Chen, Xu

arXiv.org Artificial IntelligenceMay-27-2025

Automatic related work generation (RWG) can save people's time and effort when writing a draft of related work section (RWS) for further revision. However, existing methods for RWG always suffer from shallow comprehension due to taking the limited portions of references papers as input and isolated explanation for each reference due to ineffective capturing the relationships among them. To address these issues, we focus on full-text-based RWG task and propose a novel multi-agent framework. Our framework consists of three agents: a selector that decides which section of the papers is going to read next, a reader that digests the selected section and updates a shared working memory, and a writer that generates RWS based on the final curated memory. To better capture the relationships among references, we also propose two graph-aware strategies for selector, enabling to optimize the reading order with constrains of the graph structure. Extensive experiments demonstrate that our framework consistently improves performance across three base models and various input configurations. The graph-aware selectors outperform alternative selectors, achieving state-of-the-art results. The code and data are available at https://github.com/1190200817/Full_Text_RWG.

machine learning, natural language, selector, (17 more...)

arXiv.org Artificial Intelligence

2505.19647

Country: Asia > Middle East > UAE (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding

Zhang, Chong, Tu, Yi, Zhao, Yixi, Yuan, Chenshu, Chen, Huan, Zhang, Yue, Chai, Mingxu, Guo, Ya, Zhu, Huijia, Zhang, Qi, Gui, Tao

arXiv.org Artificial IntelligenceSep-29-2024

Modeling and leveraging layout reading order in visually-rich documents (VrDs) is critical in document intelligence as it captures the rich structure semantics within documents. Previous works typically formulated layout reading order as a permutation of layout elements, i.e. a sequence containing all the layout elements. However, we argue that this formulation does not adequately convey the complete reading order information in the layout, which may potentially lead to performance decline in downstream VrD tasks. To address this issue, we propose to model the layout reading order as ordering relations over the set of layout elements, which have sufficient expressive capability for the complete reading order information. To enable empirical evaluation on methods towards the improved form of reading order prediction (ROP), we establish a comprehensive benchmark dataset including the reading order annotation as relations over layout elements, together with a relation-extraction-based method that outperforms previous methods. Moreover, to highlight the practical benefits of introducing the improved form of layout reading order, we propose a reading-order-relation-enhancing pipeline to improve model performance on any arbitrary VrD task by introducing additional reading order relation inputs. Comprehensive results demonstrate that the pipeline generally benefits downstream VrD tasks: (1) with utilizing the reading order relation information, the enhanced downstream models achieve SOTA results on both two task settings of the targeted dataset; (2) with utilizing the pseudo reading order information generated by the proposed ROP model, the performance of the enhanced models has improved across all three models and eight cross-domain VrD-IE/QA task settings without targeted optimization.

information, layout element, reading order, (13 more...)

arXiv.org Artificial Intelligence

2409.19672

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Shanghai > Shanghai (0.04)
Asia > Singapore (0.04)
(10 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Add feedback

Computer Vision Intelligence Test Modeling and Generation: A Case Study on Smart OCR

Shu, Jing, Miu, Bing-Jiun, Chang, Eugene, Gao, Jerry, Liu, Jun

arXiv.org Artificial IntelligenceSep-14-2024

AI-based systems possess distinctive characteristics and introduce challenges in quality evaluation at the same time. Consequently, ensuring and validating AI software quality is of critical importance. In this paper, we present an effective AI software functional testing model to address this challenge. Specifically, we first present a comprehensive literature review of previous work, covering key facets of AI software testing processes. We then introduce a 3D classification model to systematically evaluate the image-based text extraction AI function, as well as test coverage criteria and complexity. To evaluate the performance of our proposed AI software quality test, we propose four evaluation metrics to cover different aspects. Finally, based on the proposed framework and defined metrics, a mobile Optical Character Recognition (OCR) case study is presented to demonstrate the framework's effectiveness and capability in assessing AI function quality.

accuracy, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/AITest62860.2024.00011

2410.03536

Country: North America > United States > California > Santa Clara County > San Jose (0.05)

Genre: Research Report (1.00)

Industry: Education > Assessment & Standards > Measuring Intelligence (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.90)

Add feedback